Logistics¶
Read Carefully
Students are encouraged to work in teams of 3 people.
Projects with smaller teams are allowed, in exceptional cases, but will not have better grades for this reason.
The quality of the project will dictate its grade, not the number of people working.
The project's solution should be uploaded in Moodle before the end of December, 22nd 2024 (last day before Christmas holidays).
Teams should upload a .zip file containing all the files necessary for project evaluation. Teams should be registered in Moodle and the zip file, upload by one of the group members, should be identified as AA202425nn.zip where nn is the group number.
It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use AA_202425_Project.ipynbas template. In your .zip folder you should also include an HTML version of your notebook with all the outputs.
Decisions should be justified and results should be critically discussed.
Remember that your notebook should be as clear and organized as possible, that is, only the relevant code and experiments should be presented, not everything you tried and did not work (that can be discussed in the text, if relevant)!
Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.
Tools¶
The team should use Python 3 and Jupyter Notebook, together with Scikit-learn, Orange3, or both.
Orange3 can be used through its programmatic version, by importing and using its packages as done with Scikit-leatn, or through its workflow version.
It is up to the team to decide when to use Scikit-learn, Orange, or both.
In this context, your Jupyter notebook might have a mix of code, results, text explanations, workflow figures, etc.
In case you use Orange/workflows for some tasks you should also deliver the workflow files. Your notebook should figures for the workflow used together with an overall explaination and specific descriptions for the options taken in each of their widgets.
You can use this notebook and the sections below as example.
Dataset¶
The dataset to be analysed is PetFinder_dataset.csv, made avaliable together with this project description. This dataset, downloaded from Kaggle, contains selected and modified data from the following competition: PetFinder.my Adoption Prediction.
PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare. Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos. In this competition data scientists are supposed to develop machine learning approaches to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization. In this project, your team is supposed to use only tabular data (not Images or Image Metadata) and see how far you can go in predicting and understanding PetFinder.my adoptions. You should use both supervised and unsupervised learning to tackle 2 tasks:
- Task 1 (Supervised Learning) - Predicting Adoption and Adoption Speed
- Task 2 (Unsupervised Learning) - Charactering Pets and their Adoption Speed
The PetFinder_dataset.csv your machine learning algorithms should learn from has 14.993 instances described by 24 data fields that you might use as categorical/numerical features and corresponds to a modified version of the train.csv file made available for the competition (https://www.kaggle.com/c/petfinder-adoption-prediction/data). The target in the original Kaggle challenge is Adoption Speed.
File Descriptions¶
- PetFinder_dataset.csv - Tabular/text data for machine learning.
- breed_labels.csv - Contains Type and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID.
- state_labels.csv - Contains StateName for each StateID.
Data Fields¶
- PetID - Unique hash ID of pet profile
- Type - Type of animal (1 = Dog, 2 = Cat)
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict in the competition. See section below for more info.
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (see
BreedLabels.csvfor details) - Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (see
ColorLabel.csvfor details) - Color2 - Color 2 of pet (see
ColorLabel.csvfor details) - Color3 - Color 3 of pet (see
ColorLabel.csvfor details) - MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
AdoptionSpeed¶
The value of AdoptionSpeed describes how quickly, if at all, a pet is adopted:
- 0 - Pet was adopted on the same day as it was listed.
- 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
- 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
- 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
- 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).
Important Notes on Data Cleaning and Preprocessing¶
- Data can contain errors/typos, whose correction might improve the analysis.
- Some features can contain many values, whose grouping in categories (aggregation into bins) might improve the analysis.
- Data can contain missing values, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
- Not all features are necessarily important for the analysis.
- Depending on the analysis, some features might have to be excluded.
- Class distribution is an important characteristic of the dataset that should be carefully taken into consideration. Class imbalance might impair machine learning.
Some potentially useful links:
- Data Cleaning and Preprocessing in Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html#
- Data Cleaning and Preprocessing in Orange: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html
- Dealing with imbalance datasets: https://pypi.org/project/imbalanced-learn/ and https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t7
Task 0 (Know your Data) - Exploratory Data Analysis¶
In this section we aim to better understand the data - including features and their distribution - and to preprocess it for further use.
0.1. Loading Data¶
import numpy as np
import pandas as pd
from LoadingData import *
table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')

0.2. Understanding Data¶
The first step in this project was to understand the data and how different variables relate to the target variable, adoption speed. To do that some plots were developed.
The first one shows the relation between the type of animal and the adoption speed. We were interested in finding out whether the type of the animal (cat or dog) affects the speed at which the animal is adopted. With this is possible to observe that animals that tend to have a higher adoption speed are dogs
# How the type of the animal influences the adoption speed
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8, 6))
# Plot cats and dogs separately
for pet_type, color, label in zip([1, 2], ['#377eb8', '#ff7f00'], ['Cat', 'Dog']):
subset = df[df['Type'] == pet_type]
counts = subset['AdoptionSpeed'].value_counts().sort_index()
plt.bar(counts.index - 0.2 if pet_type == 1 else counts.index + 0.2, counts, width=0.4, label=label, color=color)
# Customization
plt.title("AdoptionSpeed vs. Type", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.ylim(0,2000)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", "4"])
plt.legend(title="Type")
plt.tight_layout()
plt.show()
On this next plot it’s shown the relation between vaccination status of the animal and the adoption speed. With this one it’s possible to verify that the animals that are not vaccinated have higher speed for adoption than the one that are vaccinated or the ones that the vaccination status isn’t sure.
# How the Vaccinated of the animal influences the adoption speed
plt.figure(figsize=(8, 6))
# Plot vacsinated, not vaccinated and not sure separately
for pet_vaccination, color, label, x_offset in zip([1, 2, 3], ['#377eb8', '#ff7f00', '#4daf4a' ], ['Vaccinated', 'Not Vaccinated', 'Not sure'], [-0.2, 0, 0.2]):
subset = df[df['Vaccinated'] == pet_vaccination]
counts = subset['AdoptionSpeed'].value_counts().sort_index()
plt.bar(counts.index + x_offset, counts, width=0.2, label=label, color=color)
# Customization
plt.title("AdoptionSpeed vs. Vaccination", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", '4'])
plt.legend(title="Vaccination Status")
plt.tight_layout()
plt.show()
Finally, on the last plot it’s represented the relation between the gender of the animal and the adoption speed and it’s possible to conclude that the female animals are quicker to be adopted than the others (Male and other).
# How the Gender of the animal influences the adoption speed
plt.figure(figsize=(8, 6))
# Plot male, female and other separately
for pet_gender, color, label, x_offset in zip([1, 2, 3], [ '#377eb8','#f781bf', '#4daf4a'], ['Male', 'Female', 'Other'], [-0.2, 0, 0.2]):
subset = df[df['Gender'] == pet_gender]
counts = subset['AdoptionSpeed'].value_counts().sort_index()
plt.bar(counts.index + x_offset, counts, width=0.2, label=label, color=color)
# Customization
plt.title("AdoptionSpeed vs. Gender", fontsize=14)
plt.xlabel("Adoption Speed", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.ylim(0,2000)
plt.xticks([0, 1, 2, 3, 4], ['0', "1", "2", "3", '4'])
plt.legend(title="Gender")
plt.tight_layout()
plt.show()
0.3. Preprocessing Data¶
The preprocessing was performed in the module "LoadingData.py". The following modifications were made:
- Removed description
- Removed name
- Removed petid
- Removed rescuerid
- Removed breed2
- Removed color3
- Removed VideoAMT
Through extensive analysis, it was observed that certain features had missing values in some rows or contained numerous zeros that did not provide meaningful information.
Description and name were removed because they had many missing values, are purely textual information. The IDs were removed due to their nature of being always or almost always unique, providing little value. Breed2, color3, and VideoAMT were removed because they contained many zeros.
In supervised Learning to predict the adoption, the feature "AdoptionSpeed" was changed to 1 if the target values corresponded to the animal being adopted, that is, if the values of "AdoptionSpeed" were equal to 0, 1, 2 and 3, and to 0 if the values were equal to 4. So then, 0 corresponds to the animal not being adopted and 1 to the animal being adopted.
For a better understanding of the data and how the models would ajust to each type of animal we divided the dataframe in two dataframes according to the type of animal. This can be found in the module "LoadingData.py".
Task 1 (Supervised Learning) - Predicting Adoption and Adoption Speed¶
In this task we will be performing 3 classification tasks:
Predicting Adoption (binary classification): a new target "Adopted" was created that considers the pet adopted is AdoptionSpeed is between 0-3 and not adopted if Adoption Speed is 4. These outcomes were encoded respectively as 0 (adopted) and 1 (not adopted).
Predicting AdoptionSpeed (multiclass classification): the original target "AdoptionSpeed" was used, whose values are in the set {0, 1, 2, 3 , 4} (5 classes).
Train specialized models for cats and dogs: The aim of this classification task was to check whether the classification performance improves when Predicting Adoption and Predicting AdoptionSpeed using a model that was trained with only cat/dog instances.
1.1. Specific Data Preprocessing for Classification¶
from LoadingData import *
from smote import *
table_X, table_y_Adopted, features_Adopted, target_name_Adopted, df_Adopted = loadDataAdopted(df)
#Dogs
table_X_Dogs, table_y_Dogs_Speed, features_Dogs, target_Name_Dogs, df_Dogs = loadDataAnimalType(df,2)
table_X_Cats, table_y_Cats_Speed, features_Cats, target_Name_Cats, df_Cats = loadDataAnimalType(df,1)
#Cats
table_X_Cats_Adopted, table_y_Cats_Adopted, features_Cats_Adopted, target_Name_Cats_Adopted, df_Cats_Adopted = loadDataAdopted(df_Cats)
table_X_Dogs_Adopted, table_y_Dogs_Adopted, features_Dogs_Adopted, target_Name_Dogs_Adopted, df_Dogs_Adopted = loadDataAdopted(df_Dogs)
X_smote, y_smote,df1 = smoteadopted(table_X,table_y_Adopted,features_Adopted)
1.2. Learning and Evaluating Classifiers¶
All models are in a file called "Models.py" to facilitate and better organize this work.
To deal with the data imbalancing we have oversampled the minority class using Synthetic Minority Oversampling Technique (SMOTE). This technique consists on creating synthetic samples of the minority class to balance the dataset, to try and improve the performance of models that may otherwise be biased toward the majority class.
The undersampling of the majority class was avoided so as to not lose relevant data. The minority class has also enough samples that we are confident that it is representative and likely to generate new samples that will not be overfitted.
SMOTE was applied to KNN and Logistic Regression since they are known to be sensitive to class imbalance. Random Forest, Decision Trees and SVM are not as sensitive and therefore did not have SMOTE applied.
Cross-validation was another technique used to evaluate the performance of the model and help ensure that the model generalizes well to unseen data, avoiding overfitting and underfitting. The data was splitted into 10 folds, so that the model could be exposed to different test and training sets. The cross-validation accuracy is the average and the value obtained for all the 10 folds. tted.
1.2.1 Predicting Adoption¶
To predict the adoption of the animal we selected 5 models. Decision Tree, Naive Bayes, K-Nearest Neighbors (KNN), Logistic Regression and Random Forest.
Accuracy was chosen as metric since as the dataset is only mildly imbalanced (i.e. 75-25), the results will still be representative for a significant portion of the dataset. Moreover, as this relates to a pet adoption problem whose solution would potentially be used in shelters, we considered that the impact of false negatives or false positives are of little consequence.
Decision Tree¶
The Decision Tree was the first one to be implemented.
Below it can be observed a representation of the Tree Model, the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation.
For this tree, 10 leaves were selected, as increasing the number further led to overfitting. Choosing 10 leaves was the optimal decision since beyond this number did not improve accuracy. By limiting the leaves, we created a faster and more efficient classifier without compromising performance.
from Models import *
%matplotlib inline
OurTree(table_X,table_y_Adopted,10,features_Adopted)
Accuracy on training set: 0.7556229731143425 Accuracy on test set: 0.7514904298713524 Fold: 1, Class dist.: [2980 8491], Acc: 0.751 Fold: 2, Class dist.: [2980 8491], Acc: 0.758 Fold: 3, Class dist.: [2980 8491], Acc: 0.756 Fold: 4, Class dist.: [2980 8491], Acc: 0.751 Fold: 5, Class dist.: [2980 8491], Acc: 0.758 Fold: 6, Class dist.: [2979 8492], Acc: 0.769 Fold: 7, Class dist.: [2980 8492], Acc: 0.753 Fold: 8, Class dist.: [2980 8492], Acc: 0.757 Fold: 9, Class dist.: [2980 8492], Acc: 0.759 Fold: 10, Class dist.: [2980 8492], Acc: 0.739 CV accuracy: 0.755 +/- 0.007
The training and test accuracies show similar values and therefore it can indicate that the model is generalizing well to data that is has not seen before. We consider an accuracy of 75.5% to be a good performance.
Naive Bayes¶
The Naive Bayes was the probabilistic algorithm used. The smote method wasn't used in this model also, due to not bringing any advantage.
Below it can be observed two representations, one for the confusion matrix obtained from the train set, and the other one for the confusion matrix obtained from the test set. There is also the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation
from Models import *
%matplotlib inline
naive(table_X,table_y_Adopted)
Accuracy on training set: 0.7192174913693901 Accuracy on test set: 0.7295262001882649 Fold: 1, Class dist.: [2980 8491], Acc: 0.716 Fold: 2, Class dist.: [2980 8491], Acc: 0.740 Fold: 3, Class dist.: [2980 8491], Acc: 0.706 Fold: 4, Class dist.: [2980 8491], Acc: 0.718 Fold: 5, Class dist.: [2980 8491], Acc: 0.732 Fold: 6, Class dist.: [2979 8492], Acc: 0.711 Fold: 7, Class dist.: [2980 8492], Acc: 0.710 Fold: 8, Class dist.: [2980 8492], Acc: 0.731 Fold: 9, Class dist.: [2980 8492], Acc: 0.728 Fold: 10, Class dist.: [2980 8492], Acc: 0.708 CV accuracy: 0.720 +/- 0.011
Overall, this model performs better at predicting animal adoptions than non-adoptions, as demonstrated in the confusion matrix above. This could be attributed to the larger number of animals in the adoption target, which reduces the impact of noise on the predictions. The training and test accuracies show similar values and therefore it can indicate that the model is generalizing well to data that is has not seen before. Although the accurayc is lower than in the previous model, an accuracy of 72.0% still indicates a good performance.
K-Nearest Neighbors (KNN)¶
The K-Nearest Neighbors (KNN) was the distance-based algorithm used. In this case the smote method improved the accuracy, so it was used.
Below it can be observed, once again, two representations, one for the confusion matrix obtained from the train set and another for the confusion matrix obtained from test set. The accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation are can also be observed.
We opted to use 3 neighbors for this analysis. While selecting only 1 neighbor could potentially boost overall performance by approximately 5%, we believe this approach would result in less reliable predictions. Using 3 neighbors provides a better trade-off between performance and predictive reliability. As the data shows, the test accuracy when using only 1 neighbor reaches 99% therefor indicating a overfiting
from Models import *
%matplotlib inline
X_smote, y_smote,df1 = smoteadopted(table_X,table_y_Adopted,features_Adopted)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8818800247371676 Accuracy on test set: 0.7717996289424861 Fold: 1, Class dist.: [6791 6791], Acc: 0.746 Fold: 2, Class dist.: [6791 6791], Acc: 0.728 Fold: 3, Class dist.: [6792 6791], Acc: 0.763 Fold: 4, Class dist.: [6792 6791], Acc: 0.777 Fold: 5, Class dist.: [6792 6791], Acc: 0.793 Fold: 6, Class dist.: [6792 6791], Acc: 0.825 Fold: 7, Class dist.: [6791 6792], Acc: 0.808 Fold: 8, Class dist.: [6791 6792], Acc: 0.807 Fold: 9, Class dist.: [6791 6792], Acc: 0.829 Fold: 10, Class dist.: [6791 6792], Acc: 0.799 CV accuracy: 0.788 +/- 0.032
This model predicts the number of non-adopted animals more accurately. While it performs better at identifying non-adopted animals, the number of adopted animals incorrectly predicted as non-adopted remains the similar. In comparison to Naive Bayes and Decison Tree, this model is not generalizing as well to data that is has not seen before, however it shows the highest accuracy so far (78.8%).
Logistic Regression¶
The linear model used was the logistic regression due to being a good model when dealing with binary classification. It was also used the smote method.
Below it can be observed some graphs where the importance of each feature is represented. In this model there is a parameter that determines the strength of the regularization, $C$, so in this work this model was tested using diffetent values for this parameter (1, 0.01 and 100). It can also be observed the accuracies for the train and test set, for each one of those cases, and the accuracy obtained with cross-validation.
from Models import *
logreg(X_smote, y_smote,df1,features_Adopted)
(15092, 17) Adopted 1.0 7546 0.0 7546 Name: count, dtype: int64 Fold: 1, Class dist.: [6791 6791], Acc: 0.682 Fold: 2, Class dist.: [6791 6791], Acc: 0.638 Fold: 3, Class dist.: [6792 6791], Acc: 0.661 Fold: 4, Class dist.: [6792 6791], Acc: 0.654 Fold: 5, Class dist.: [6792 6791], Acc: 0.661 Fold: 6, Class dist.: [6792 6791], Acc: 0.673 Fold: 7, Class dist.: [6791 6792], Acc: 0.651 Fold: 8, Class dist.: [6791 6792], Acc: 0.644 Fold: 9, Class dist.: [6791 6792], Acc: 0.674 Fold: 10, Class dist.: [6791 6792], Acc: 0.654 CV accuracy: 0.659 +/- 0.013 Train set score (Accuracy)= 0.6609238924649754 Test set score (Accuracy)= 0.6581272084805654 Train set score (Accuracy)= 0.6603559257856872 Test set score (Accuracy)= 0.6576855123674912 Train set score (Accuracy)= 0.659503975766755 Test set score (Accuracy)= 0.6687279151943463
Train accuracy of L1 logreg with C=0.001 = 0.64 Test accuracy of L1 logreg with C=0.001 = 0.64 Train accuracy of L1 logreg with C=1.000 = 0.66 Test accuracy of L1 logreg with C=1.000 = 0.66
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\svm\_base.py:1242: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn(
Train accuracy of L1 logreg with C=100.000 = 0.66 Test accuracy of L1 logreg with C=100.000 = 0.66
Although the training and test accuracies show similar values, this model obtained the lowest accuracy so far (66%). There are also no significant differentes between the several parameters used for regularization (1, 0.01 and 100).
Random Forest¶
The Random Forest was another ensemble model that was implemented.
Below it can be observed a representation of the confusion matrix obtained, the accuracies for both the train set and the test set and the accuracy after performing the Cross-Validation.
Two methodologies were tested with the Random Forest: one with hyperparameter tuning and one without. Overall, both approaches yielded similar performance.
Even without applying SMOTE, cross-validation accuracy was around 70%, likely due to the substantial amount of data labeled as 0,due to a low density of the minority class.
However, when SMOTE was applied alongside hyperparameter tuning, the cross-validation accuracy improved significantly to 82%, potencialy learning more accurate boundaries. As a result, the combination of SMOTE and hyperparameter tuning was adopted for better predictive performance.
from Models import *
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8748122625673647 Accuracy on test set: 0.8359395706334481
Fold: 1, Class dist.: [6791 6791], Acc: 0.639 Fold: 2, Class dist.: [6791 6791], Acc: 0.618 Fold: 3, Class dist.: [6792 6791], Acc: 0.646 Fold: 4, Class dist.: [6792 6791], Acc: 0.797 Fold: 5, Class dist.: [6792 6791], Acc: 0.920 Fold: 6, Class dist.: [6792 6791], Acc: 0.924 Fold: 7, Class dist.: [6791 6792], Acc: 0.922 Fold: 8, Class dist.: [6791 6792], Acc: 0.908 Fold: 9, Class dist.: [6791 6792], Acc: 0.940 Fold: 10, Class dist.: [6791 6792], Acc: 0.908 CV accuracy: 0.822 +/- 0.129
1.2.2 Predicting AdoptionSpeed¶
To predict the adoption speed of the animal we also selected 5 models. Tree, Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Random Forest.
Even knowing the results are in general better when the ammount of classes is lower in the target we decided to use 5 classes to this predictions.
Decision Tree¶
When predicting the Adoption speed, the Decision Tree was also the first one to be implemented. The smote method didn't bring any advantages so it wasn't applied.
Below it can be observed a representation of the Tree Model, the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation The selected number of branches in the decision tree was retained for the same reasons outlined previously.
from Models import *
%matplotlib inline
OurTree(table_X, table_y,20,features)
Accuracy on training set: 0.37995606234961815 Accuracy on test set: 0.356448070285535 Fold: 1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.354 Fold: 2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.372 Fold: 3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.361 Fold: 4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.362 Fold: 5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.365 Fold: 6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.369 Fold: 7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.352 Fold: 8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.372 Fold: 9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.352 Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.342 CV accuracy: 0.360 +/- 0.009
Since the training and test accuracies show similar values, it indicates that the model is not memorizing the training set. However, both accuracies are quite low.
Naive Bayes¶
The Naive Bayes was the probabilistic algorithm used. Below it can be observed two representations, one for the confusion matrix obtained from the train set, and the other one for the confusion matrix obtained from the test set. There is also the accuracies obtained for both the train set and the test set and the accuracy obtained after performing the Cross-Validation
from Models import *
%matplotlib inline
naive(table_X,table_y)
Accuracy on training set: 0.3490950936290407 Accuracy on test set: 0.34860370254157513 Fold: 1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.368 Fold: 2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.356 Fold: 3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.338 Fold: 4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.341 Fold: 5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.340 Fold: 6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.352 Fold: 7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.347 Fold: 8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.347 Fold: 9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.356 Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.330 CV accuracy: 0.348 +/- 0.010
K-Nearest Neighbors (KNN)¶
The number of neighbors was kept as is because, although smaller numbers could yield better performance, they might also result in worse outcomes in other aspects.
from Models import *
%matplotlib inline
X_smote, y_smote,df1 = smoteadoptionspeed(table_X,table_y,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.6974068835454974 Accuracy on test set: 0.4724186704384724 Fold: 1, Class dist.: [2546 2545 2545 2545 2545], Acc: 0.458 Fold: 2, Class dist.: [2546 2545 2545 2545 2545], Acc: 0.464 Fold: 3, Class dist.: [2545 2545 2545 2546 2545], Acc: 0.479 Fold: 4, Class dist.: [2545 2545 2545 2546 2545], Acc: 0.472 Fold: 5, Class dist.: [2545 2545 2545 2545 2546], Acc: 0.483 Fold: 6, Class dist.: [2545 2545 2545 2545 2546], Acc: 0.455 Fold: 7, Class dist.: [2545 2545 2546 2545 2545], Acc: 0.460 Fold: 8, Class dist.: [2545 2545 2546 2545 2545], Acc: 0.458 Fold: 9, Class dist.: [2545 2546 2545 2545 2545], Acc: 0.588 Fold: 10, Class dist.: [2545 2546 2545 2545 2545], Acc: 0.630 CV accuracy: 0.495 +/- 0.059
The values the training and test accuracies suggest that the model might be overfitting.
Suport Vector Machine (SVM)¶
The linear model used was the Suppor Vector Machine due to being a better model when dealing with multiclass classification. It was also used the smote method. Below it can be observed the accuracies for the train and test set, the accuracy obtained with cross-validation and some of the coefficients and intercept obtained.
from Models import *
svm(table_X,table_y)
Fold: 1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.346 Fold: 2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.357 Fold: 3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.333 Fold: 4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.360 Fold: 5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.344 Fold: 6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.353 Fold: 7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.358 Fold: 8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.363 Fold: 9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.356 Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.340 CV accuracy: 0.351 +/- 0.009 Training set score (Accuracy) = 0.34721205146981904 Test set score (Accuracy) = 0.35707561970505175 ------------------------------------------------------------------------------------------ LinearSVC coefficients and intercept: Coeficients (w) = [[ 7.83809898e-06 8.33508544e-06 -2.87790682e-04 -3.26342841e-06 8.20653586e-06 1.08270637e-05 -3.60710476e-06 1.00720256e-05 4.25356570e-06 5.35123525e-06 4.19781688e-06 5.09375045e-07 -8.05500142e-06 -3.74462946e-05 -2.09434799e-05 -3.17319348e-05] [ 2.56615961e-05 -2.63347940e-04 -1.13213616e-03 -1.91031185e-05 4.51061832e-05 3.95016127e-05 -1.10863363e-05 2.16473154e-05 2.24342478e-05 1.18687157e-05 2.14380418e-05 -1.23812142e-06 -4.36277045e-05 2.42666816e-05 -6.41633510e-06 -4.47705804e-05] [-2.28606850e-07 -1.51550734e-04 1.47097558e-04 -9.75704454e-06 -1.95939254e-06 6.05786823e-06 5.74039267e-06 -2.74652539e-08 3.65764078e-06 -5.14964913e-06 1.54960199e-06 -1.44156682e-06 -6.04014490e-06 -4.86031022e-05 -1.15380660e-05 4.45133056e-05] [-2.76209581e-05 -1.27473557e-04 -5.19431452e-04 2.00894240e-05 -2.54542662e-05 -8.30633723e-06 1.08882801e-05 -1.66911920e-05 -2.54914157e-05 -2.66017420e-05 -2.20940019e-05 -4.97987744e-07 -1.15622954e-05 -2.39747800e-04 -1.00676790e-05 3.50999656e-04] [-1.88852723e-03 5.30655156e-02 2.29272028e-03 5.26858619e-03 -7.80421841e-03 -5.24509489e-03 -1.59007336e-03 -3.59996356e-03 -1.11900935e-03 2.59163550e-03 -3.95055723e-04 5.67413469e-04 1.50875301e-02 -1.29181731e-06 -2.98093268e-05 -2.38886120e-02]] Intercept (b) = [-3.44259016e-09 -4.34800533e-09 3.71453007e-09 3.12048471e-08 -4.27018836e-06]
Random Forest¶
The Random Forest was another tree model that was implemented, and also another example where the smote method didn't bring any advantages. Below it can be observed a representation of the confusion matrix obtained, the accuracies for both the train set and the test set and the accuracy after performing the Cross-Validation
from Models import *
RandomF(table_X,table_y)
Accuracy on training set: 0.5467099068940265 Accuracy on test set: 0.36774395983683716
Fold: 1, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.369 Fold: 2, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.373 Fold: 3, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.365 Fold: 4, Class dist.: [ 316 2440 3221 2514 2980], Acc: 0.378 Fold: 5, Class dist.: [ 316 2439 3221 2515 2980], Acc: 0.374 Fold: 6, Class dist.: [ 316 2440 3221 2515 2979], Acc: 0.387 Fold: 7, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.363 Fold: 8, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.370 Fold: 9, Class dist.: [ 316 2440 3221 2515 2980], Acc: 0.367 Fold: 10, Class dist.: [ 315 2440 3222 2515 2980], Acc: 0.341 CV accuracy: 0.369 +/- 0.011
Dogs¶
Tree Model¶
OurTree(table_X_Dogs, table_y_Dogs_Speed,20,features_Dogs)
Accuracy on training set: 0.3840255591054313 Accuracy on test set: 0.34099616858237547 Fold: 1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.360 Fold: 2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.356 Fold: 3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.356 Fold: 4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.358 Fold: 5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.366 Fold: 6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.347 Fold: 7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.329 Fold: 8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.359 Fold: 9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.364 Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.296 CV accuracy: 0.349 +/- 0.020
Naive Bayes¶
naive(table_X_Dogs, table_y_Dogs_Speed)
Accuracy on training set: 0.3226837060702875 Accuracy on test set: 0.3275862068965517 Fold: 1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.329 Fold: 2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.332 Fold: 3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.326 Fold: 4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.335 Fold: 5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.312 Fold: 6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.351 Fold: 7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.324 Fold: 8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.310 Fold: 9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.316 Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.312 CV accuracy: 0.325 +/- 0.012
K-Nearest Neighbors (KNN)¶
X_smote, y_smote,df1 = smoteadoptionspeed(table_X_Dogs, table_y_Dogs_Speed,features)
Ourknn(X_smote, y_smote,15)
Accuracy on training set: 0.47716594625070474 Accuracy on test set: 0.3900789177001127 Fold: 1, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.400 Fold: 2, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.417 Fold: 3, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.387 Fold: 4, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.400 Fold: 5, Class dist.: [1277 1277 1277 1277 1277], Acc: 0.385 Fold: 6, Class dist.: [1277 1277 1277 1278 1277], Acc: 0.406 Fold: 7, Class dist.: [1278 1277 1277 1277 1277], Acc: 0.381 Fold: 8, Class dist.: [1277 1278 1277 1277 1277], Acc: 0.426 Fold: 9, Class dist.: [1277 1277 1278 1277 1277], Acc: 0.454 Fold: 10, Class dist.: [1277 1277 1277 1277 1278], Acc: 0.463 CV accuracy: 0.412 +/- 0.027
Support Vector Machine (SVM)¶
from Models import *
svm(table_X_Dogs, table_y_Dogs_Speed)
Fold: 1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.351 Fold: 2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.340 Fold: 3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.323 Fold: 4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.351 Fold: 5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.339 Fold: 6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.329 Fold: 7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.355 Fold: 8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.343 Fold: 9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.351 Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.337 CV accuracy: 0.342 +/- 0.010 Training set score (Accuracy) = 0.34930777422790205 Test set score (Accuracy) = 0.3397190293742018 ------------------------------------------------------------------------------------------ LinearSVC coefficients and intercept: Coeficients (w) = [[-1.22416050e-08 -1.34356196e-04 7.86186622e-04 -2.15750683e-05 4.61157641e-05 1.60582951e-04 -8.14339073e-06 1.77672776e-04 7.02812609e-05 8.65329397e-05 4.91440056e-05 8.49401365e-06 -1.26833622e-04 -1.28990629e-04 -2.74932261e-05 -3.98255522e-04] [ 1.66541256e-06 -2.82153854e-02 8.38782648e-04 -1.06608157e-03 4.12978226e-03 -2.27556058e-04 -3.69437785e-04 2.47661144e-03 2.36941090e-03 1.41147346e-03 2.21644910e-03 -1.81429512e-04 -4.39763437e-03 -2.51547640e-04 -1.45168123e-05 -2.36588781e-03] [ 1.59178174e-08 -2.88712089e-04 5.61677806e-05 -3.30079148e-05 3.79255081e-05 -1.66476627e-05 4.87207969e-06 -1.40517164e-05 -1.56449842e-05 -3.12143469e-05 -6.81148507e-06 -2.85236030e-06 -1.48666151e-05 4.08900523e-04 -1.11197778e-05 1.58838249e-04] [ 2.10055221e-08 -3.35551925e-05 -1.20144651e-04 1.02999530e-05 -1.68050506e-05 4.56244270e-05 1.12442128e-06 -1.02557487e-05 -1.29454699e-05 -2.06220297e-05 -8.01328851e-06 1.39869556e-06 -5.80312350e-06 -1.71011482e-04 -1.40732744e-05 2.20595811e-04] [-1.62399517e-05 5.06338416e-02 -2.72591614e-03 9.43713727e-03 -1.91296711e-02 -5.76416419e-03 -4.20529594e-03 -7.08743922e-03 6.78749446e-04 7.42946004e-03 -5.17382953e-04 3.78668032e-04 2.68022584e-02 1.58670659e-04 2.81364446e-06 -2.84530322e-02]] Intercept (b) = [-6.12080252e-09 8.32706279e-07 7.95890870e-09 1.05027611e-08 -8.11997584e-06]
Random Forest¶
RandomF(table_X_Dogs, table_y_Dogs_Speed)
Accuracy on training set: 0.5727369542066028 Accuracy on test set: 0.3314176245210728
Fold: 1, Class dist.: [ 201 1404 1581 1081 1367], Acc: 0.346 Fold: 2, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.379 Fold: 3, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.318 Fold: 4, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.382 Fold: 5, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.324 Fold: 6, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.350 Fold: 7, Class dist.: [ 202 1404 1581 1082 1366], Acc: 0.335 Fold: 8, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.356 Fold: 9, Class dist.: [ 201 1404 1582 1082 1366], Acc: 0.318 Fold: 10, Class dist.: [ 201 1404 1582 1081 1367], Acc: 0.319 CV accuracy: 0.343 +/- 0.023
Cats¶
With the exception of the KNN model, the accuracy was marginally higher when trained specifically on cat data.
Tree Model¶
OurTree(table_X_Cats, table_y_Cats_Speed,60,features_Cats)
Accuracy on training set: 0.45794776886695454 Accuracy on test set: 0.3871763255240444 Fold: 1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.387 Fold: 2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.405 Fold: 3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.391 Fold: 4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.385 Fold: 5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.407 Fold: 6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.387 Fold: 7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.387 Fold: 8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.403 Fold: 9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.392 Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.355 CV accuracy: 0.390 +/- 0.014
Naive Bayes¶
naive(table_X_Cats, table_y_Cats_Speed)
Accuracy on training set: 0.36829117828500924 Accuracy on test set: 0.3563501849568434 Fold: 1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.362 Fold: 2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.390 Fold: 3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.367 Fold: 4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.348 Fold: 5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.374 Fold: 6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.355 Fold: 7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.352 Fold: 8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.346 Fold: 9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.380 Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.329 CV accuracy: 0.360 +/- 0.017
K-Nearest Neighbors (KNN)¶
X_smote, y_smote,df1 = smoteadoptionspeed(table_X_Cats, table_y_Cats_Speed,features)
Ourknn(X_smote, y_smote,7)
Accuracy on training set: 0.575941230486685 Accuracy on test set: 0.47107438016528924 Fold: 1, Class dist.: [1307 1307 1306 1307 1307], Acc: 0.450 Fold: 2, Class dist.: [1307 1307 1306 1307 1307], Acc: 0.410 Fold: 3, Class dist.: [1307 1307 1307 1306 1307], Acc: 0.448 Fold: 4, Class dist.: [1307 1307 1307 1306 1307], Acc: 0.442 Fold: 5, Class dist.: [1306 1307 1307 1307 1307], Acc: 0.481 Fold: 6, Class dist.: [1306 1307 1307 1307 1307], Acc: 0.457 Fold: 7, Class dist.: [1307 1307 1307 1307 1306], Acc: 0.507 Fold: 8, Class dist.: [1307 1307 1307 1307 1306], Acc: 0.507 Fold: 9, Class dist.: [1307 1306 1307 1307 1307], Acc: 0.507 Fold: 10, Class dist.: [1307 1306 1307 1307 1307], Acc: 0.541 CV accuracy: 0.475 +/- 0.038
Random Forest¶
RandomF(table_X_Cats, table_y_Cats_Speed)
Accuracy on training set: 0.5395846185482213 Accuracy on test set: 0.40998766954377314
Fold: 1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.401 Fold: 2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.405 Fold: 3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.396 Fold: 4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.357 Fold: 5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.422 Fold: 6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.403 Fold: 7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.383 Fold: 8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.367 Fold: 9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.418 Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.350 CV accuracy: 0.390 +/- 0.024
Support Vector Machine¶
svm(table_X_Cats, table_y_Cats_Speed)
Fold: 1, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.348 Fold: 2, Class dist.: [ 114 1036 1640 1432 1614], Acc: 0.390 Fold: 3, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.368 Fold: 4, Class dist.: [ 114 1036 1639 1433 1614], Acc: 0.374 Fold: 5, Class dist.: [ 114 1035 1640 1433 1614], Acc: 0.388 Fold: 6, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.380 Fold: 7, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.383 Fold: 8, Class dist.: [ 115 1036 1640 1433 1613], Acc: 0.394 Fold: 9, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.392 Fold: 10, Class dist.: [ 114 1036 1640 1433 1614], Acc: 0.346 CV accuracy: 0.376 +/- 0.016 Training set score (Accuracy) = 0.3802179724449928 Test set score (Accuracy) = 0.3723797780517879 ------------------------------------------------------------------------------------------ LinearSVC coefficients and intercept: Coeficients (w) = [[-1.04201362e-09 1.27213265e-05 -2.91347484e-04 -2.44616858e-06 2.83742724e-06 -7.04570169e-06 -1.44049270e-06 1.62119180e-06 -4.05516918e-07 8.45631526e-07 1.60099993e-06 4.22711687e-08 -2.19728504e-06 1.63707762e-05 -2.13030042e-05 -1.26915196e-05] [-3.39425624e-09 -1.19372211e-04 -1.08089254e-03 -1.85033133e-05 1.77176905e-05 1.24146077e-06 -1.57759737e-06 1.19385312e-05 7.73008662e-06 3.79555742e-06 1.10586715e-05 -6.62946849e-07 -2.82407465e-05 2.66218255e-04 -8.47915555e-06 -4.62646848e-05] [ 8.54994644e-09 -4.13246863e-04 4.62384527e-05 -1.36414910e-05 -4.35018487e-05 4.45842229e-05 2.05132027e-05 1.11513845e-05 3.00469261e-05 5.05590855e-06 1.31747957e-05 -3.85234770e-06 -1.27152980e-05 -2.15704504e-04 -1.07753684e-05 5.66758958e-05] [ 3.63637621e-08 -2.05031311e-04 -5.45870204e-04 2.88789249e-05 -4.65005108e-06 -8.80755771e-06 6.97760947e-06 -1.84243421e-05 -2.62858136e-05 -2.29642040e-05 -3.41450099e-05 -2.71040924e-06 -1.23099939e-05 -2.94367024e-04 -8.54394643e-06 4.07128114e-04] [-1.00080305e-05 5.03314166e-02 2.38883129e-03 1.39355875e-02 -1.10713064e-02 -5.69439717e-04 -5.16182207e-03 -7.76858041e-03 -1.99970714e-03 3.73530043e-03 2.63345487e-03 1.92951687e-03 3.86477354e-02 6.43374711e-05 -3.12065343e-05 -2.15073439e-02]] Intercept (b) = [-1.04201362e-09 -3.39425624e-09 8.54994644e-09 3.63637621e-08 -1.00080305e-05]
Prediction Adoption¶
Dogs¶
Tree Model¶
OurTree(table_X_Dogs_Adopted, table_y_Dogs_Adopted,20,features_Dogs)
Accuracy on training set: 0.7799787007454739 Accuracy on test set: 0.7535121328224776 Fold: 1, Class dist.: [1366 4268], Acc: 0.764 Fold: 2, Class dist.: [1367 4268], Acc: 0.748 Fold: 3, Class dist.: [1367 4268], Acc: 0.746 Fold: 4, Class dist.: [1366 4269], Acc: 0.748 Fold: 5, Class dist.: [1366 4269], Acc: 0.768 Fold: 6, Class dist.: [1366 4269], Acc: 0.751 Fold: 7, Class dist.: [1366 4269], Acc: 0.772 Fold: 8, Class dist.: [1366 4269], Acc: 0.744 Fold: 9, Class dist.: [1366 4269], Acc: 0.762 Fold: 10, Class dist.: [1366 4269], Acc: 0.751 CV accuracy: 0.755 +/- 0.010
Naive Bayes¶
naive(table_X_Dogs_Adopted, table_y_Dogs_Adopted)
Accuracy on training set: 0.7356762513312034 Accuracy on test set: 0.7113665389527458 Fold: 1, Class dist.: [1366 4268], Acc: 0.727 Fold: 2, Class dist.: [1367 4268], Acc: 0.727 Fold: 3, Class dist.: [1367 4268], Acc: 0.712 Fold: 4, Class dist.: [1366 4269], Acc: 0.724 Fold: 5, Class dist.: [1366 4269], Acc: 0.733 Fold: 6, Class dist.: [1366 4269], Acc: 0.716 Fold: 7, Class dist.: [1366 4269], Acc: 0.730 Fold: 8, Class dist.: [1366 4269], Acc: 0.735 Fold: 9, Class dist.: [1366 4269], Acc: 0.709 Fold: 10, Class dist.: [1366 4269], Acc: 0.725 CV accuracy: 0.724 +/- 0.008
K-Nearest Neighbors (KNN)¶
X_smote, y_smote,df1 = smoteadopted(table_X_Dogs_Adopted, table_y_Dogs_Adopted,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8839128907622058 Accuracy on test set: 0.7771338250790305 Fold: 1, Class dist.: [3416 3416], Acc: 0.759 Fold: 2, Class dist.: [3416 3416], Acc: 0.792 Fold: 3, Class dist.: [3416 3417], Acc: 0.751 Fold: 4, Class dist.: [3416 3417], Acc: 0.829 Fold: 5, Class dist.: [3416 3417], Acc: 0.808 Fold: 6, Class dist.: [3416 3417], Acc: 0.808 Fold: 7, Class dist.: [3417 3416], Acc: 0.827 Fold: 8, Class dist.: [3417 3416], Acc: 0.819 Fold: 9, Class dist.: [3417 3416], Acc: 0.809 Fold: 10, Class dist.: [3417 3416], Acc: 0.814 CV accuracy: 0.802 +/- 0.025
Support Vector Machine¶
svm(table_X_Dogs_Adopted, table_y_Dogs_Adopted)
Fold: 1, Class dist.: [1366 4268], Acc: 0.745 Fold: 2, Class dist.: [1367 4268], Acc: 0.748 Fold: 3, Class dist.: [1367 4268], Acc: 0.759 Fold: 4, Class dist.: [1366 4269], Acc: 0.751 Fold: 5, Class dist.: [1366 4269], Acc: 0.746 Fold: 6, Class dist.: [1366 4269], Acc: 0.744 Fold: 7, Class dist.: [1366 4269], Acc: 0.752 Fold: 8, Class dist.: [1366 4269], Acc: 0.757 Fold: 9, Class dist.: [1366 4269], Acc: 0.748 Fold: 10, Class dist.: [1366 4269], Acc: 0.749 CV accuracy: 0.750 +/- 0.005 Training set score (Accuracy) = 0.7559105431309904 Test set score (Accuracy) = 0.7369093231162197 ------------------------------------------------------------------------------------------ LinearSVC coefficients and intercept: Coeficients (w) = [[ 1.66029634e-05 -5.06078236e-02 2.69648567e-03 -9.56303762e-03 1.93612794e-02 5.97777858e-03 4.32642784e-03 7.22599565e-03 -7.17715463e-04 -7.54557136e-03 5.17667854e-04 -3.86695506e-04 -2.71828312e-02 -1.62068346e-04 -2.63912997e-06 2.85086170e-02]] Intercept (b) = [8.30148172e-06]
Random Forest¶
X_smote, y_smote,df1 = smoteadopted(table_X_Dogs_Adopted, table_y_Dogs_Adopted,features)
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8912890762205831 Accuracy on test set: 0.8414120126448894
Fold: 1, Class dist.: [3416 3416], Acc: 0.605 Fold: 2, Class dist.: [3416 3416], Acc: 0.625 Fold: 3, Class dist.: [3416 3417], Acc: 0.606 Fold: 4, Class dist.: [3416 3417], Acc: 0.885 Fold: 5, Class dist.: [3416 3417], Acc: 0.930 Fold: 6, Class dist.: [3416 3417], Acc: 0.928 Fold: 7, Class dist.: [3417 3416], Acc: 0.935 Fold: 8, Class dist.: [3417 3416], Acc: 0.928 Fold: 9, Class dist.: [3417 3416], Acc: 0.935 Fold: 10, Class dist.: [3417 3416], Acc: 0.946 CV accuracy: 0.832 +/- 0.145
Cats¶
Tree Model¶
OurTree(table_X_Cats_Adopted, table_y_Cats_Adopted,20,features_Dogs)
Accuracy on training set: 0.7645486325313593 Accuracy on test set: 0.7552404438964242 Fold: 1, Class dist.: [1614 4222], Acc: 0.761 Fold: 2, Class dist.: [1614 4222], Acc: 0.763 Fold: 3, Class dist.: [1613 4223], Acc: 0.750 Fold: 4, Class dist.: [1613 4223], Acc: 0.747 Fold: 5, Class dist.: [1613 4223], Acc: 0.755 Fold: 6, Class dist.: [1614 4223], Acc: 0.764 Fold: 7, Class dist.: [1614 4223], Acc: 0.752 Fold: 8, Class dist.: [1614 4223], Acc: 0.761 Fold: 9, Class dist.: [1614 4223], Acc: 0.762 Fold: 10, Class dist.: [1614 4223], Acc: 0.727 CV accuracy: 0.754 +/- 0.011
Naive Bayes¶
naive(table_X_Cats_Adopted, table_y_Cats_Adopted)
Accuracy on training set: 0.7149907464528069 Accuracy on test set: 0.7114673242909988 Fold: 1, Class dist.: [1614 4222], Acc: 0.689 Fold: 2, Class dist.: [1614 4222], Acc: 0.737 Fold: 3, Class dist.: [1613 4223], Acc: 0.712 Fold: 4, Class dist.: [1613 4223], Acc: 0.700 Fold: 5, Class dist.: [1613 4223], Acc: 0.723 Fold: 6, Class dist.: [1614 4223], Acc: 0.711 Fold: 7, Class dist.: [1614 4223], Acc: 0.704 Fold: 8, Class dist.: [1614 4223], Acc: 0.718 Fold: 9, Class dist.: [1614 4223], Acc: 0.741 Fold: 10, Class dist.: [1614 4223], Acc: 0.704 CV accuracy: 0.714 +/- 0.015
K-Nearest Neighbors (KNN)¶
X_smote, y_smote,df1 = smoteadopted(table_X_Cats_Adopted, table_y_Cats_Adopted,features)
Ourknn(X_smote, y_smote,3)
Accuracy on training set: 0.8807061340941512 Accuracy on test set: 0.7561497326203208 Fold: 1, Class dist.: [3365 3365], Acc: 0.759 Fold: 2, Class dist.: [3365 3365], Acc: 0.737 Fold: 3, Class dist.: [3365 3365], Acc: 0.761 Fold: 4, Class dist.: [3365 3365], Acc: 0.758 Fold: 5, Class dist.: [3365 3365], Acc: 0.814 Fold: 6, Class dist.: [3365 3365], Acc: 0.824 Fold: 7, Class dist.: [3365 3365], Acc: 0.802 Fold: 8, Class dist.: [3365 3365], Acc: 0.807 Fold: 9, Class dist.: [3366 3365], Acc: 0.803 Fold: 10, Class dist.: [3365 3366], Acc: 0.807 CV accuracy: 0.787 +/- 0.029
Support Vector Machine¶
svm(table_X_Cats_Adopted, table_y_Cats_Adopted)
Fold: 1, Class dist.: [1614 4222], Acc: 0.737 Fold: 2, Class dist.: [1614 4222], Acc: 0.741 Fold: 3, Class dist.: [1613 4223], Acc: 0.729 Fold: 4, Class dist.: [1613 4223], Acc: 0.741 Fold: 5, Class dist.: [1613 4223], Acc: 0.733 Fold: 6, Class dist.: [1614 4223], Acc: 0.739 Fold: 7, Class dist.: [1614 4223], Acc: 0.715 Fold: 8, Class dist.: [1614 4223], Acc: 0.730 Fold: 9, Class dist.: [1614 4223], Acc: 0.739 Fold: 10, Class dist.: [1614 4223], Acc: 0.721 CV accuracy: 0.732 +/- 0.009 Training set score (Accuracy) = 0.7234217561176228 Test set score (Accuracy) = 0.7355117139334155 ------------------------------------------------------------------------------------------ LinearSVC coefficients and intercept: Coeficients (w) = [[ 1.60397922e-06 -3.17046092e-02 -2.12355930e-03 -1.60970449e-03 4.44005663e-04 5.34751741e-04 8.35971566e-04 1.04477003e-03 7.49589370e-04 -4.12631690e-04 2.44926785e-04 -3.27439410e-04 -4.38366944e-03 -7.95952693e-05 2.73017726e-05 1.11212790e-02]] Intercept (b) = [1.60397922e-06]
Random Forest¶
X_smote, y_smote,df1 = smoteadopted(table_X_Cats_Adopted, table_y_Cats_Adopted,features)
RandomF(X_smote, y_smote)
Accuracy on training set: 0.8610912981455064 Accuracy on test set: 0.8042780748663102
Fold: 1, Class dist.: [3365 3365], Acc: 0.659 Fold: 2, Class dist.: [3365 3365], Acc: 0.690 Fold: 3, Class dist.: [3365 3365], Acc: 0.650 Fold: 4, Class dist.: [3365 3365], Acc: 0.698 Fold: 5, Class dist.: [3365 3365], Acc: 0.898 Fold: 6, Class dist.: [3365 3365], Acc: 0.914 Fold: 7, Class dist.: [3365 3365], Acc: 0.902 Fold: 8, Class dist.: [3365 3365], Acc: 0.897 Fold: 9, Class dist.: [3366 3365], Acc: 0.902 Fold: 10, Class dist.: [3365 3366], Acc: 0.909 CV accuracy: 0.812 +/- 0.113
1.3. Classification - Final Discussion and Conclusions¶
The models generally demonstrated similar performance across the same tasks. As expected, we observed poorer performance in multilabel prediction tasks compared to binary classification, given the increased complexity of the former. It is evident that SMOTE contributed to enhancing the prediction of class 0.
It is also noteworthy that, after applying SMOTE, only some models exhibited improved performance. This could be attributed to the way each model operates. For instance, in the case of KNN, if synthetic data points were generated close to existing points, this could lead to improved predictions.
Overall, the accuracy for predicting adoption or non-adoption was approximately 80% in some models and around 75% in others, indicating consistent results. However, when predicting the speed of adoption, most models underperformed. This failure could be due to the imbalance between the number of features and the number of rows. In this analysis, 16 features were used against a dataset of 5,000 rows, representing a significant disparity that may compromise the viability of multilabel predictions.
In general, when attempting to reduce the number of classes, the accuracy of AdoptionSpeed increases from around 30% to approximately 55%. This demonstrates that most classifiers are not particularly effective distinguishing multiple classes, one reason for this can be the overlaping of points in classes, this is data with similar features but in different classes.
In future analyses, utilizing one-hot encoding could potentially improve results by increasing the number of columns and, consequently, reducing noise-related issues. In this case, noise in a single column could disproportionately impact results, contributing to the failure of multilabel predictions.
Prediction Adoption
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.755 | 0.751 | 0.755 |
| Naive Bayes | 0.719 | 0.729 | 0.720 |
| KNN | 0.881 | 0.771 | 0.788 |
| Logistic Regression | 0.660 | 0.658 | 0.659 |
| Random Forest | 0.874 | 0.835 | 0.822 |
Prediction Adoption Speed
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.379 | 0.356 | 0.360 |
| Naive Bayes | 0.349 | 0.348 | 0.348 |
| KNN | 0.697 | 0.472 | 0.495 |
| SVM | 0.347 | 0.357 | 0.351 |
| Random Forest | 0.546 | 0.367 | 0.369 |
In general, when using the same classifiers to predict the adoption of cats and dogs, there were some differences. This difference can be due to similarity in data of cats therefore obtaining better results in models that use distances or use similatiry as it can be seen in KNN and Random Forest. However, in conclusion, the differences observed are not significant, as they fall within the margin of error. The results are the same as in the prediction of all the animals.
When predicting adoption, the values are very similar. This could be due to the lower number of classes, which makes it easier to predict the classes and results in better performance.
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.384 | 0.340 | 0.349 |
| Naive Bayes | 0.322 | 0.327 | 0.325 |
| KNN | 0.477 | 0.390 | 0.412 |
| SVM | 0.349 | 0.339 | 0.342 |
| Random Forest | 0.572 | 0.331 | 0.343 |
Cats¶
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.457 | 0.387 | 0.390 |
| Naive Bayes | 0.368 | 0.356 | 0.360 |
| KNN | 0.575 | 0.471 | 0.475 |
| SVM | 0.380 | 0.372 | 0.376 |
| Random Forest | 0.539 | 0.409 | 0.390 |
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.779 | 0.753 | 0.755 |
| Naive Bayes | 0.735 | 0.711 | 0.724 |
| KNN | 0.883 | 0.777 | 0.802 |
| SVM | 0.755 | 0.736 | 0.750 |
| Random Forest | 0.891 | 0.841 | 0.832 |
Cats¶
| Model | Training ACC | Test ACC | Cross Validation ACC |
|---|---|---|---|
| Tree | 0.764 | 0.755 | 0.754 |
| Naive Bayes | 0.714 | 0.711 | 0.714 |
| KNN | 0.880 | 0.756 | 0.787 |
| SVM | 0.723 | 0.735 | 0.732 |
| Random Forest | 0.861 | 0.804 | 0.812 |
Task 2 (Unsupervised Learning) - Charactering Pets and their Adoption Speed¶
In this task you should use unsupervised learning to characterize pets and their adoption speed. You have 2 clustering:
- Use Clustering algorithms to find similar groups of adopted pets. When animals are adopted, is it possible to find groups of pets with the same/similar adoption speed? Evaluate clustering results using internal and external metrics.
- Be creative and define and explore your own unsupervised learning task! What else would it be interesting to find out?
2.1. Preprocessing Data for Clustering¶
To complete this part of the work, the dataset was first scaled. Given that K-Means relies on distance-based methods, it was essential to control the scale of the variables.
from LoadingData import *
from Models import *
table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')
%matplotlib inline
table_X_Scaled, table_y_Scaled, features_Scaled, target_name_Scaled, df_Scaled = loadScaledData(df)
2.2. Learning and Evaluating Clusterings¶
To explore potential similarities among the animals, clustering was performed using 2, 4, and 8 clusters. The aim was to determine whether the data could be effectively divided by factors such as the type of animal, adoption speed, or adoption speed based on the animal type.
The silhouette score evaluates how well each data point aligns with its assigned cluster. Scores range from -1 (worst) to 1 (best), with values near 0 indicating overlapping clusters.
The ARI (Adjusted Rand Index) ranges from -1 (completely dissimilar) to 1 (perfect match), with 0 indicating random clustering.
Clustering¶
i=2
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters 2 Kmeans silhouette_score: 0.0980206208981116 Kmeans Adjusted Rand Index (ARI): 0.0031734913631197123 HCA silhouette_score: 0.12665142552829103 HCA Adjusted Rand Index (ARI): 0.0007490641101930682
i=4
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters 4 Kmeans silhouette_score: 0.09880751324918044 Kmeans Adjusted Rand Index (ARI): 0.004793905554401137 HCA silhouette_score: 0.12048475375960427 HCA Adjusted Rand Index (ARI): 0.003456109079462546
i=8
print("Number of Clusters ", i ," \n")
OurKmeans(table_X_Scaled, table_y_Scaled,i)
Number of Clusters 8 Kmeans silhouette_score: 0.10453662380484792 Kmeans Adjusted Rand Index (ARI): 0.009729493678839164 HCA silhouette_score: 0.06667274333941256 HCA Adjusted Rand Index (ARI): 0.010731271596755989
| Model | Nº Clusters | Silhouette Score | ARI |
|---|---|---|---|
| KNN | 2 | 0.098 | 0.003 |
| KNN | 4 | 0.098 | 0.004 |
| KNN | 8 | 0.104 | 0.009 |
| HCA | 2 | 0.126 | 0.0007 |
| HCA | 4 | 0.120 | 0.003 |
| HCA | 8 | 0.066 | 0.010 |
As observed, the silhouette scores are generally close to 0, indicating that the clusters are very similar to one another. Meanwhile, the ARI shows that the clusters are almost random, likely due to overlap in the most significant components.
To determine if similar clusters exhibited comparable speeds, heatmaps were generated to visualize this information. Across all heatmaps, it was evident that each cluster contained a range of values for various speeds. This demonstrates that no clear separation rule exists when using speeds. Consequently, graphs with the principal components and the clusters were created to investigate this further.
Next, biclustering appeared to be an interesting avenue to explore. Using PCA, visualizations were created with both K-Means and Co-Clustering methods. Visually, clusters with 2 or 4 groups seem coherent, whereas the 8-cluster scenario is a clear example of overfitting it can be verified in the images below.
Biclustering¶
from sklearn.decomposition import PCA
table_X, table_y, features, target_name, df = load_data('PetFinder_dataset.csv')
table_X_Scaled, table_y_Scaled, features_Scaled, target_name_Scaled, df_Scaled = loadScaledData(df)
df_Scaled.to_excel("output_file.xlsx", index=False)
def plot_pca_with_clusters(X_scaled, num_clusters, df, ax):
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(X_scaled)
cluster_labels = kmeans.labels_
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_scaled)
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=cluster_labels, palette="Set1", s=10, edgecolor='black', ax=ax)
ax.set_title(f"PCA of KMeans Clustering (Clusters: {num_clusters})")
ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.legend(title="Cluster")
def plotBicluster(table_X, nclusters, ax):
clustering = SpectralCoclustering(n_clusters=nclusters, random_state=0)
clustering.fit(table_X)
row_labels = clustering.row_labels_
pca = PCA(n_components=2)
table_X_PCA = pca.fit_transform(table_X)
sns.scatterplot(x=table_X_PCA[:, 0], y=table_X_PCA[:, 1], hue=row_labels, palette="Set1", s=10, edgecolor='black', ax=ax)
ax.set_title(f"PCA of Spectral Coclustering (Clusters: {nclusters})")
ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.legend(title="Cluster")
for num_clusters in [2, 4, 8]:
print(f"\nNumber of Clusters: {num_clusters}")
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
plot_pca_with_clusters(table_X_Scaled, num_clusters, df, axes[0])
plotBicluster(table_X_Scaled, num_clusters, axes[1])
plt.tight_layout()
plt.show()
Number of Clusters: 2
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10)
Number of Clusters: 4
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10)
Number of Clusters: 8
C:\Users\Filip\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10)
Overfitting can be observed through the significant overlap of clusters in the following images, both in clustering and biclustering. This indicates that no clearly defined groups exist when considering the two principal components. In K-Means, overfitting becomes evident as soon as the number of clusters reaches 4. However, in Co-Clustering, the groups exhibit less overlap, resulting in a clearer visualization.
That said, the visualization with 2 clusters is better in K-Means, as points that are closer together generally tend to be more similar.
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import SpectralCoclustering
def plotBiclusterHeatmap(table_X, nclusters, ax1):
clustering = SpectralCoclustering(n_clusters=nclusters, random_state=0)
clustering.fit(table_X)
row_order = np.argsort(clustering.row_labels_)
col_order = np.argsort(clustering.column_labels_)
table_X_reordered = table_X.iloc[row_order, col_order]
sns.heatmap(table_X_reordered, cmap='twilight_shifted', ax=ax1, cbar=True)
ax1.set_title(f"Bicluster Heatmap (Clusters: {nclusters})")
ax1.set_xlabel("Reordered Columns")
ax1.set_ylabel("Reordered Rows")
fig, ax1 = plt.subplots(1, 1, figsize=(10, 8))
plotBiclusterHeatmap(table_X=df_Scaled, nclusters=5, ax1=ax1)
plt.tight_layout()
plt.show()
The best bicluster created was the one with 5 clusters, as it had 5 well-defined groups, each containing a specific number of features.
- Cluster 1 included the features: Dewormed, Vaccinated, and Sterilized.
- Cluster 2 contained the features: Gender, PhotoAmt, and Quantity.
- Cluster 3 included: Breed1, MaturitySize, FurLength, and Health.
- Cluster 4 contained: Type, Color2, and State.
- Cluster 5 included: Age, Color1, and Fee.
We believe Cluster 1 makes sense, as each feature pertains to a medical procedure. Cluster 2, which includes PhotoAmt, Gender, and Quantity, also seems logical, as all these features are related to visual data. Cluster 3 contains breed wich normally reflects the maturitySize, the fur length and sometimes health may influence maturity and fur length. Cluster 4 is more challenging to interpret, as it's harder to find a clear correlation between its features.. Cluster 5, however, appears reasonable when considering that age and color might be correlated, and the Fee feature likely reflects this as well—older animals may not require a fee.
2.3. Clustering - Final Discussion and Conclusions¶
In general, when using K-Means and HCA, our results did not indicate clusters with similar adoption times. Instead, the clusters exhibited highly heterogeneous adoption times, making it difficult to draw clear conclusions. When plotting the data in the two principal components, we observed significant overlapping of data points, which suggests that the results may be influenced by this overlap. This could have impacted the ability of the clustering algorithms to distinguish distinct patterns in adoption times.
Our data and results indicated weak correlation within clusters, which appeared to be random and very similar to each other.
Given these challenges, we decided to explore biclustering as an alternative approach. The goal was to see if we could form better-defined clusters, identify the features associated with each cluster, and assess whether these groupings made sense in the context of adoption time.
The biclusters obtained, in general, made sense from our perception, as each cluster contained similar features, which suggests that biclustering is a promising method for segmenting this data. This approach allows for more meaningful clustering without focusing on speed, providing a better way to uncover inherent patterns in adoption times.
3. Final Comments and Conclusions¶
In the first part of the project we achieved good results when attempting to predict whether a pet would be adopted (or not).
In the multiclass prediction, we obtained 30% accuracy, which aligns with expectations, although we know that the model can performed better when using fewer classes. Creating a model with good performance is a challenge in multiclass prediction.
A major contributing factor to this is the imbalance between the number of features and the number of rows (16 features x 5000 rows). To address this, we applied SMOTE to help identify the minority class, preventing data removal and maintaining variance in the dataset. These results suggest that there is room for improvement by adjusting the features and possibly using one-hot encoding to further enhance the data preprocessing step.
Overall, the best results were consistently achieved with KNN and Random Forest. This outcome is consistent with the nature of the data, as the data points share similarities, and both models are effective at identifying and leveraging these similarities for classification. Given the characteristics of the data, these models are better suited to predict animal adoption rather than the speed at which they will be adopted.
In terms of clustering, the groups were not effectively clustered by adoption speed, as verified by the analysis, and overall, the clustering performance was poor. This indicates that clustering methods may not be ideal for this dataset, and further adjustments or alternative approaches may be required to improve the results.